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•^T High resolution microarrays and second-generation sequencing plat- 

forms are powerful tools to investigate genome-wide alterations in DNA 
copy number, methylation and gene expression associated with a dis- 
ease. An integrated genomic profiling approach measures multiple omics 
data types simultaneously in the same set of biological samples. Such 
Q^ ' approach renders an integrated data resolution that would not be avail- 

■^T , able with any single data type. In this study, we use penalized latent 

variable regression methods for joint modeling of multiple omics data 
C$ [ types to identify common latent variables that can be used to cluster 

~tZ. . patient samples into biologically and clinically relevant disease subtypes. 

We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996) 267-288], elastic 
net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301-320] and fused 
lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 91-108] methods 
to induce sparsity in the coefficient vectors, revealing important genomic 
<— s features that have significant contributions to the latent variables. An 

.— +. ' iterative ridge regression is used to compute the sparse coefficient vec- 

y/~\ , tors. In model selection, a uniform design [Monographs on Statistics and 

Applied Probability (1994) Chapman & Hall] is used to seek "experimen- 
tal" points that scattered uniformly across the search domain for efficient 
J^, , sampling of tuning parameter combinations. We compared our method 

to sparse singular value decomposition (SVD) and penalized Gaussian 
mixture model (GMM) using both real and simulated data sets. The 
proposed method is applied to integrate genomic, epigenomic and tran- 
K^ ' scriptomic data for subtype analysis in breast and lung cancer data sets. 

a' 

1. Introduction. Clustering analysis is an unsupervised learning method 
that aims to group data into distinct clusters based on a certain measure of 
similarity among the data points. Clustering analysis has many applications 
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in a wide variety of fields including pattern recognition, image processing 
and bioinformatics. In gene expression microarray studies, clustering can- 
cer samples based on their gene expression profile has revealed molecular 
subgroups associated with histopathological categories, drug response and 
patient survival differences [Perou et al. (1999), Alizadeh et al. (2000), Sorlie 
et al. (2001), Lapointe et al. (2003), Hoshida et al. (2003)]. 

In the past few years, integrative genomic studies are emerging at a fast 
pace where in addition to gene expression data, genome-wide data sets cap- 
turing somatic mutation patterns, DNA copy number alterations and DNA 
methylation changes are simultaneously obtained in the same biological sam- 
ples. A fundamental challenge in translating cancer genomic findings into 
clinical application lies in the ability to find "driver" genetic and genomic 
alterations that contribute to tumor initiation, progression and metastasis 
[Chin and Gray (2008), Simon (2010)]. As integrated genomic studies have 
emerged, it has become increasingly clear that true oncogenic mechanisms 
are more visible when combining evidence across patterns of alterations in 
DNA copy number, methylation, gene expression and mutational profiles 
[Cancer Genome Atlas Research Network (2008), TCGA Network (2011)]. 
Integrative analysis of multiple "omic" data types can help the search for 
potential "drivers" by uncovering genomic features that tend to be dys- 
regulated by multiple mechanisms [Chin and Gray (2008)]. A well-known 
example is the HER2 oncogene which can be activated through DNA am- 
plification and mRNA over-expression. We will discuss the HER2 example 
further in our motivating example. 

In this paper, we focus on the class discovery problem given multiple 
omics data sets (multidimensional data) for tumor subtype discovery. A ma- 
jor challenge in subtype discovery based on gene expression microarray data 
is that the clinical and therapeutic implications for most existing molecular 
subtypes of cancer are largely unknown. A confounding factor is that expres- 
sion changes may be related to cellular activities independent of tumorige- 
nesis, and therefore leading to subtypes that may not be directly relevant 
for diagnostic and prognostic purposes. By contrast, as we have shown in 
our previous work [Shen, Olshen and Ladanyi (2009)], a joint analysis of 
multiple omics data types offers a new paradigm to gain additional insights. 
Individually, none of the genomic-wide data type alone can completely cap- 
ture the complexity of the cancer genome or fully explain the underlying 
disease mechanism. Collectively, however, true oncogenic mechanisms may 
emerge as a result of joint analysis of multiple genomic data types. 

Somatic DNA copy number alterations are key characteristics of cancer 
[Beroukhim et al. (2010)]. Copy number gain or amplification may lead to 
activation of oncogenes (e.g., HER2 in Figure 1). Tumor suppressor genes 
can be inactivated by copy number loss. High-resolution array-based com- 
parative genomic hybridization (aCGH) and SNP arrays have become dom- 
inant platforms for generating genome-wide copy number profiles. The mea- 
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surement typical of aCGH platforms is a log-ratio of normalized intensities 
of genomic DNA in experimental versus control samples. For SNP arrays, 
copy number measures are represented by log of total copy number (logR) 
and parent-specific copy number as captured by a B-allele frequency (BAF) 
[Chen, Xing and Zhang (2011), Olshen et al. (2011)]. Both platforms gener- 
ate contiguous copy number measures along ordered chromosomal locations 
(an example is given in Figure 6). Spatial smoothing methods are desirable 
for modeling copy number data. 

In addition to copy number aberrations, there are widespread DNA methy- 
lation changes at CpG dinucleotide sites (regions of DNA where a Cytocine 
nucleotide occurs next to a Guanine nucleotide) in the cancer genome. DNA 
methylation is the most studied epigenetic event in cancer [Holliday (1979), 
Feinberg and Vogelstein (1983), Laird (2003, 2010)]. Tumor suppressor genes 
are frequently inactivated by hypermethylation (increased methylation of 
CpG sites in the promoter region of the gene), and oncogenes can be ac- 
tivated through the promoter hypomethylation. DNA methylation arrays 
measure the intensities of methylated probes relative to unmethylated probes 
for tens of thousands of CpG sites located at promoter regions of protein 
coding genes. M-values are calculated by taking log-ratios of methylated 
and unmethylated probe intensities [Irizarry et al. (2008)], similar to the 
M-values used for gene expression microarrays which quantify the relative 
expression level (abundance of a gene's mRNA transcript) in cancer samples 
compared to a normal control. 

In this paper, we focus on the class discovery problem given multiple omics 
data sets for tumor subtype discovery. Suppose t = l,...,T different genome- 
scale data types (DNA copy number, methylation, mRNA expression, etc.) 
are obtained in j = 1, . . . , n tumor samples. Let Xj be the pt x n data ma- 
trix where Xjt denote the ith row and Xyj the jth column of Xj. Rows are 
genomic features and columns are samples. Here we use the term genomic 
feature and the corresponding feature index i in the equations throughout 
the paper to refer to either a protein-coding gene (typically for expression 
and methylation data) or ordered genomic elements that do not necessarily 
have a one-to-one mapping to a specific gene (copy number measure along 
chromosomal positions) depending on the data type. 

Let Z be a j x n matrix where rows are latent variables and columns 
are samples, and g is the number of latent variables. Latent variables can 
be interpreted as "fundamental" variables that determine the values of the 
original p variables [Jolliffe (2002)]. In our context, we use latent variables to 
represent disease driving factors (underlying the wide spectrum of genomic 
alterations of various types) that determine biologically and clinically rele- 
vant subtypes of the disease. Typically, g <C ^2 t Pt, providing a low-dimension 
latent subspace to the original genomic feature space. Following a similar 
argument for reduced-rank linear discriminant analysis in Hastie, Tibshirani 
and Friedman (2009), a rank-g approximation where g < K — 1 is sufficient 
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for separating K clusters among the n data points. For the rest of the paper, 
we assume the dimension of Z is (K — 1) x n with mean zero and identity 
covariance matrix. A joint latent variable model expressed in matrix form is 

(1) X t = W t Z + E 4 , t = l,...,T. 

In the above, Wj is a pt x (K — 1) coefficient (or loading) matrix relating 
Xt and Z with Wjt being the jth row and w^ the kth column of Wj, and 
Et is a pt x n matrix where the column vectors ej,j = 1, .. . ,n, represent 
uncorrelated error terms that follow a multivariate distribution with mean 
zero and a diagonal covariance matrix ^t = {p\ ,■■■ , &p t ) • Each data matrix 
is row-centered so no intercept term is presented in equation (1). 

Equation (1) provides an effective integration framework in which the la- 
tent variables Z = (zi, . . . ,zk-i) are common for all data types, represent- 
ing a probabilistic low-rank approximation simultaneously to the T original 
data matrices. In Section 3.2 we point out its connection and differences 
from singular value decomposition (SVD). In Sections 6 and 7 we illustrate 
that applying SVD to the combined data matrix broadly fails to achieve an 
effective integration of various data types. 

Equation (1) is the basis of our initial work [Shen, Olshen and Ladanyi 
(2009)] in which we introduced an integrative model called iCluster. We con- 
sidered a soft-thresholding estimate of W[ that continuously shrinks the co- 
efficients for noninformative features toward zero. The motivation for sparse 
coefficient vectors is clearly indicated by Figure 1 panels (D) and (E). A basic 
sparsity-inducing approach is to use a lasso penalty [Tibshirani (1996)]. Nev- 
ertheless, different data types call for appropriate penalty terms such that 
each Wj is sparse with a specific sparsity structure. In particular, copy num- 
ber aberrations tend to occur in contiguous regions along chromosomal po- 
sitions (Figure 6), for which the fused lasso penalty [Tibshirani et al. (2005)] 
is appropriate. In gene expression data where groups of genes involved in the 
same biological pathway are co-regulated and thus highly correlated in their 
expression levels, the elastic net penalty [Zou and Hastie (2005)] is useful 
to encourage a grouping effect by selecting strongly correlated features to- 
gether. In this paper, we present a sparse iCluster framework that employs 
different penalty terms for the estimation of Wt associated with different 
data types. 

In Section 3 we present the methodological details of the latent variable 
regression combined with lasso, elastic net and fused lasso penalty terms. 
To determine the optimal combination of the penalty parameter values, a 
very large search space needs to be covered, which presents a computational 
challenge. An exhaustive grid search is ineffective. We use a uniform design 
by Fang and Wang (1994) that seeks "experimental" points that scattered 
uniformly across the search domain has superior convergence rates over the 
conventional grid search (Section 3.3). Section 4 presents an EM algorithm 
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for maximizing the penalized data log- likelihood. The number of clusters 
K is unknown and must be estimated. Section 5 discusses the estimation 
of K based on a cross-validation approach. Section 6 presents results from 
simulation studies. Section 7 presents results from real data applications. 
In particular, Section 7.1 presents an integrative analysis of epigenomic and 
transcriptomic profiling data using a breast cancer data set [Holm et al. 
(2010)]. In Section 7.2 we illustrate our proposed method to construct a 
genome- wide portrait of copy number induced gene expression changes using 
a lung cancer data set [Chitale et al. (2009)]. We conclude the paper with a 
brief summary in Section 8. 

2. Motivating example. In this section we show an example where an 
integrated analysis of multiple omics data sets is far more insightful than 
separate analyses. Pollack et al. (2002) used customized microarrays to gen- 
erate measurements of DNA copy number and mRNA expression in parallel 
for 37 primary breast cancer and 4 breast cancer cell line samples. Here the 
number of data types T = 2. In the mRNA expression data matrix Xi, the 
individual element x^i refers to the observed expression of the ith gene in 
the j'th tumor. In the DNA copy number data matrix X2, the individual 
element Xij2 refers to the observed log-ratio of tumor versus normal copy 
number of the ith gene in the jth tumor. In this example, both data types 
have gene-centric measurement by design. 

A heatmap of the genomic features on chromosome 17 is plotted in Fig- 
ure 1. In the heatmap, rows are genes ordered by their genomic position 
and columns are samples ordered by hierarchical clustering [panels (A)] or 
by the lasso iCluster method [panels (B)]. There are two main subclasses 
in the 41 samples: the cell line subclass (samples labeled in red) and the 
HER2 tumor subclass (samples labeled in green). It is clear in Figure 1(A) 
that these subclasses cannot be distinguished well from separate hierarchical 
clustering analyses. 

Separate clustering followed by manual integration as depicted in Fig- 
ure 1(A) remains the most frequently applied approach to analyze multiple 
omics data sets in the current literature due to its simplicity and the lack of 
a truly integrative approach. However, Figure 1(A) clearly shows its lack of 
a unified system for cluster assignment and poor correlation of the outcome 
with biological and clinical annotation. As we will illustrate in the simula- 
tion study in Section 7, separate clustering can fail drastically in estimating 
the true number of clusters, classifying samples to the correct clusters and 
selecting cluster-associated features. Several limitations of this common ap- 
proach are responsible for its poor performance: 

• Correlation between data sets is not utilized to inform the clustering anal- 
ysis, ignoring an important piece of information that plays a key role for 
identifying "driver" features of biological importance. 
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FlG. 1. A motivating example using the Pollack data set to demonstrate that a joint anal- 
ysis using the lasso iCluster method outperforms the separate clustering approach in subtype 
analysis given DNA copy number and mRNA expression data. (A) Heatmap with samples 
ordered by separate hierarchical clustering. Rows are genes and samples are columns. Sam- 
ples labeled in red are breast cancer cell line samples. Samples labeled in green are HER2 
breast tumors. (B) Heatmap with samples ordered by integrative clustering using the lasso 
iCluster method. (C) Kaplan-Meier plot indicates the HER2 subtype has poor survival out- 
come. (D) Standard cluster centroid estimates. (E) Sparse coefficient estimates under the 
lasso iCluster model. 



• Separate analysis of paired genomic data sets is an inefficient use of the 
available information. 

• It is not straightforward to integrate the multiple sets of cluster assign- 
ments that are data-type dependent without extensive prior biological 
information. 

• The standard clustering method includes all genomic features regardless 
of their relevance to clustering. 

Our method aims to overcome these obstacles by formulating a joint anal- 
ysis across multiple omics data sets. The heatmap in Figure 1(B) demon- 
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strates the superiority of our working model in correctly identifying the sub- 
groups (vertically divided by solid black lines). From left to right, cluster 1 
(samples labeled in red) corresponds to the breast cancer cell line subgroup, 
distinguishing cell line samples from tumor samples. Cluster 2 corresponds 
to the HER2 tumor subtype (samples labeled in green), showing concor- 
dant amplification in the DNA and overexpression in mRNA at the HER2 
locus (chr 17ql2). This subtype is associated with poor survival as shown in 
Figure 1(C). Cluster 3 (samples labeled in black) did not show any distinct 
patterns, though a pattern may have emerged if there were additional data 
types such as DNA methylation. 

The motivation for sparseness in the coefficient estimates is illustrated by 
Figure 1(E). It clearly reveals the HER2-suhtype specific genes (including 
HER2, GRB7, TOP2A). By contrast, the standard cluster centroid estima- 
tion is flooded with noise [Figure 1(D)], revealing an inherent problem with 
clustering methods without regularization. 

The copy number data example in Figure 1 depicts a narrow (focal) DNA 
amplification event on a single chromosome involving only a few genes (in- 
cluding HER2). Nevertheless, copy number is more frequently altered across 
long contiguous regions. In the lung cancer data example we will discuss in 
Section 7.2, chromosome arm-level copy number gains (log-ratio > 0) and 
losses (log-ratio < 0) as illustrated in Figure 6 are frequently observed, mo- 
tivating the use of a fused lasso penalty to account for such structural de- 
pendencies. In the next section we discuss methodological details on lasso, 
fused lasso and elastic net in the latent variable regression. 

3. Method. Assuming Gaussian error terms, equation (1) implies the 
following conditional distribution 

(2) X t |Z~JV(W t Z,* t ), t = l,...,T. 

Further assuming Z ~ iV(0,I), the marginal distribution for the observed 
data is then 

(3) X t ~JV(0,S t ), 

where 5]^ = W(WJ + *&f Direct maximization of the marginal data log- 
likelihood is difficult. We consider an expectation-maximization (EM) al- 
gorithm [Dempster, Laird and Rubin (1977)]. In the EM framework, the 
latent variables are considered "missing data." Therefore, the "complete" 
data log-likelihood that consists of these latent variables is 

T T 

■ J>g l*tl - - J>((Xt - W t Z)'* t - 1 (X 4 - W t Z)) 



n 



(4) 



2^ ol ' 2 

4=1 4=1 

-itr(Z'Z). 
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The constant term in l c has been omitted. In the next section we discuss a 
penalized complete data log-likelihood to induce sparsity in Wj. 

3.1. Penalized likelihood approach. As mentioned earlier, sparsity in Wt 
directly impacts the interpretability of the latent variables. A zero entry in 
the ith row and kth column (wn^t = 0) means that the ith genomic feature 
has no weight on the feth latent variable in data type t. If the entire row 
Wn = 0, then this genomic feature has no contribution to the latent variables 
and is considered noninformative. We use a penalized complete-data log- 
likelihood as follows to enforce desired sparsity in the estimated Wj: 

T 

(5) 4, P ({W t }f =1 , {*}f =1 ) =£ C -J2 j a, (W 4 ), 

4=1 

where £ c is the complete-data log-likelihood function defined in (4) which 
controls the fitness of the model; J\ t (Wt) is a penalty function which con- 
trols the complexity of the model; and Aj is a nonnegative tuning parameter 
that determines the balance between the two. The subscript p in £ CjP stands 
for penalized. 

Different data types call for different penalty functions. We introduce 
three types of penalties in the iCluster model: lasso, elastic net, and fused 
lasso. Both lasso and elastic net regression methods have been applied to 
gene expression data [Zhao and Simon (2010), Barretina et al. (2012)]. For 
feature selection, the elastic net may have an additional advantage by shrink- 
ing coefficients of correlated features toward each other, and thus encourages 
a grouping effect toward selecting highly correlated features together. Copy 
number aberrations tend to occur in contiguous regions along chromosomal 
positions, motivating the use of fused lasso. 

3.1.1. The lasso penalty. The lasso penalty is a basic sparsity-inducing 
that takes the form 

K-\ Pt 

(6) J\ t (Wt) = h J £ / J2\wikt\, 

k=\ i=l 

where Wikt is the element in the ith row and /cth column of W{. The l\- 
penalty continuously shrinks the coefficients toward zero and thereby yields 
a substantial decrease in the variance of the coefficient estimates. Owing 
to the singularity of the ^i-penalty at the origin (wikt = 0), some estimated 
Wikt will be exactly zero. The degree of sparseness is controlled by the tuning 
parameter At. 

3.1.2. The fused lasso penalty. To account for the strong spatial depen- 
dence along genomic ordering typical in DNA copy number data, we consider 
the fused lasso penalty [Tibshirani et al. (2005)], which takes the following 
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form: 

K-l pt K-\ pt 

( 7 ) J \t( W t) = Alt Yl ^2 \ Wikt \ + A2 * J2 ^2 \ Wikt ~ W (i-l)kt\, 

k=l i=l fe=l i=2 

where Ait and A2t are two nonnegative tuning parameters. The first penalty 
encourages sparseness while the second encourages smoothness along index i. 
The fused lasso penalty is particularly suitable for DNA copy number data 
where contiguous regions of a chromosome tend to be altered in the same 
fashion [Tibshirani and Wang (2008)]. 

3.1.3. The elastic net penalty. The elastic net penalty [Zou and Hastie 
(2005)] takes the form 

A'— 1 pt K—\ pt 

(8) j At (w t ) = Au y, J2 Kfc*i + ^ E £«&*' 

fe=l i=\ fc=l i=l 

where Ait and A2t are two nonnegative tuning parameters. Zou and Hastie 
(2005) showed that the elastic net penalty tends to select or remove highly 
correlated predictors together in a linear regression setting by enforcing their 
estimated coefficients to be similar. In our experience, the elastic net penalty 
tends to be more numerically stable than the lasso penalty in our model. 

Figure 2 shows the effectiveness of sparse iCluster using a simulated pair of 
data sets (T = 2). We simulated a single length-n latent variable z ~ iV(0, 1) 
where n = 100. The coefficient matrix Wi consists of a single column wi 
of length p\ = 200 with the first 20 elements set to 1.5 and the remaining 
elements set to 0, that is, wn = 1.5 for i = 1, ... ,20 and elsewhere. The 
coefficient matrix W2 consists of a single column W2 of length pi = 200 
and set to have Wi% = 1.5 for i = 101, . . . , 120 and elsewhere. The lasso, 
elastic net and fused lasso coefficient estimates are plotted to contrast the 
noisy cluster centroids estimated separately in data type 1 (left) and in data 
type 2 (right) in the top panel of Figure 2. The algorithm for computing 
these sparse estimates will be discussed in Section 4. 

3.2. Relationship to singular value decomposition (SVD). An SVD/PCA 
on the concatenated data matrix X = (X' x , . . . , X^)' is a special case of equa- 
tion (1) that requires a common covariance matrix across data types. Specif- 
ically, it can be shown that when *i = ■ • ■ = \I> T = <r 2 I, equation (1) reduces 
to a "probabilistic SVD/PCA" on the concatenated data matrix X. Follow- 
ing similar derivation in Tipping and Bishop (1999), the maximum likelihood 
estimates of W, where W = (W^, . . . , W^)' is the concatenated coefficient 
matrix, coincide with the first K — 1 eigenvectors of the sample covariance 
matrix XX' or the right singular vector of the concatenated data matrix X. 
The MLE of a 2 is the average of the remaining n — K + 1 eigenvalues, cap- 
turing the residual variation averaged over the "lost" dimensions. 



10 R. SHEN, S. WANG AND Q. MO 

Cluster Centroids Cluster Centroids 



Jill 



A* 



^Jj^l^^j^^^jliM^ ^ji^v.g^j^,./,. °^ yg i ,^.. ., i,. ji y 1^^ v - n^^i^ Vi^-HE 



Lasso iCIuster Lasso iCIuster 

|L -ii . I, 



Elastic net ICIuster Elastic net iCIuster 

I -1 ■ 



Fused Lasso iCIuster Fused Lasso iCIuster 

I 11 l_ 



Fig. 2. A simulated pair of data sets each with 100 subjects (n — 100,) and 200 features 
(pt = 200, t — 1,2,), and 2 subgroups (K — 2). Top panel plots the cluster centroids in data 
set 1 (left) and in data set 2 (right). Estimated sparse iCIuster coefficients are plotted 
below. 



The major assumption is the requirement that all features have the same 
variance. The genomic data types, however, are fundamentally different and 
the method we propose primarily aims to deal with heteroscedasticity among 
genomic features of various types. The common covariance assumption that 
leads to SVD is therefore not suitable for integrating omics data types. It is 
worth mentioning that feature scaling may not necessarily yield a\ = ■ ■ ■ = 
Op t . In our modeling framework, af is the conditional variance of Xij given Zj . 
Standardization on x^j will yield the same marginal variance across features, 
but the conditional variances of features are not necessarily the same after 
standardization. 

Our method aims to identify common influences across data types through 
the latent component Z. The independent error terms Ej,t = 1, . . . ,T cap- 
ture the remaining variances unique to each data type after accounting for 
the common variance. In SVD, however, the unique variances are absorbed 
in the term WZ by enforcing \l/i = • • • = ^t = <7 2 I- As a result, common 
and unique variations are no longer separable. This is in fact one of the 
fundamental differences between the factor analysis model and PCA, which 
has practical importance in integrative modeling. 
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In Sections 6 and 7 we illustrate that SVD on the concatenated data 
matrix broadly fails to achieve an effective integration in both simulated 
and real data sets. By contrast, our method can more effectively deal with 
heteroscedasticity among genomic features of various types. The contrast 
with a sparse SVD method lies in the fact that our framework allows block- 
wise sparse constraints to the coefficient matrix. 

3.3. Uniform sampling. An exhaustive grid search for the optimal com- 
bination of the penalty parameters that maximizes a certain criterion (the 
optimization criterion will be discussed in Section 5) is inefficient and compu- 
tationally prohibitive. We use the uniform design (UD) of Fang and Wang 
(1994) to generate good lattice points from the search domain, a similar 
strategy adopted by Wang et al. (2008). A key theoretical advantage of UD 
over the traditional grid search is the uniform space filling property that 
avoids wasteful computation at close-by points. Let D be the search re- 
gion. Using the concept of discrepancy that measures uniformity on D C R d 
with arbitrary dimension d, which is basically the Kolmogorov statistic for 
a uniform distribution on D, Fang and Wang (1994) point out that the dis- 
crepancy of the good lattice point set from a uniform design converges to 
zero with a rate of 0(n~ 1 (logn) ), where n (a prime number) denotes the 
number of generated points on D. They also point out that the sequence of 
equi-lattice points on D has a rate of 0{n~ 1 ' d ) and the sequence of uniformly 
distributed random numbers on D has a rate of 0(n~ 1 ' 2 (loglog?i) 1 ' 2 ). Thus, 
the uniform design has an optimal rate for d>2. 

4. Algorithm. We now discuss the details of our algorithm for param- 
eter estimation in sparse iCluster. The latent variables (columns of Z) are 
considered to be "missing" data. The algorithm therefore iterates between 
an E-step for imputing Z and a penalized maximization step (M-step) that 
updates the estimates of Wj and *&t for all t. Given the latent variables, the 
data types are conditionally independent and, thus, the integrative omics 
problem can be decomposed into solving T independent subproblems with 
suitable penalty terms. The penalized estimation procedures are therefore 
"decoupled" for each data type given the latent variables Z. When conver- 
gence is reached, cluster membership will be assigned for each tumor based 
on the posterior mean of the latent variable Z. 

E-step. In the E-step, we take the expectation of the penalized complete- 
data log-likelihood £ CjP as defined in equations (4) and (5), which primar- 
ily involves computing two conditional expectations given the current 
parameter estimates: 

(9) E[Z\X]=W''S- 1 X 

(10) E[ZZ'\X] = I - W'S _1 W + E[Z\X]E[Z\X]' , 
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where S = WW' + \I/ and \I/ = diag(\l/i, . . . , VPt)- Here, the posterior 
mean in (9) effectively provides a simultaneous rank-(.K" — 1) approxima- 
tion to the original data matrices X. 
M-step. In the M-step, given the quantities in equations (9) and (10), we 
maximize the penalized complete-data log-likelihood to update the esti- 
mates of W[ and ^f 

(1) Sparse estimates of W 4 . For t = 1,...,T, we obtain the penalized 
estimates by 

1 T 
W t <- argmin - ^E[tr((X t - WtZ/tf^Xt - W t Z))|W t , 4f t ] 



(11] 



w < 2 t=i 



+ J\ t (Wt 



w ikt 



where W( and \P( denote the parameter estimates in the last EM iter- 
ation. We apply a local quadratic approximation [Fan and Li (2001)] 
to the l\ term involved in the penalty function J\ t (Wt). Using the 
fact | a | = a 2 /|a| when a^O, we consider the following quadratic 
approximation to the i\ term: 

K-\ Pt 2 

(i2) A< e s ,;«, 

fc=i i=i 

Due to the uncorrelated error terms (diagonal VS^) and "noncou- 
pling" structure of the lasso and elastic net penalty terms, the es- 
timation of W[ can then be computed feature-by-feature by taking 
derivatives with respect to each row Wn for i = 1, . . . ,p t . The solu- 
tion for (11) under various penalty terms can then be obtained by 
iteratively computing the following ridge regression estimates: 

(a) Lasso estimates. For i = l,...,pt, 

(13) w«=x it S[Z'|X t ,Wt,* t ](^[ZZ / |X t ,Wt,*t]+A i )- 1 , 

where A* = 2af\ t diag{l/|«>at|, ■ . . , l/K(#-i)tl}- Computing (13) 
only requires the inversion of a {K — 1) x (K — 1) matrix in the 
latent subspace. 

(b) Elastic net estimates. Similarly, we consider a quadratic approx- 
imation to the l\ term in the elastic net penalty and obtain the 
solution for (11) by iteratively computing a ridge regression esti- 
mate similar to (13) but with Aj = 2oj (Aitdiag{l/|wiit|, • • • > 1/ 

\Wi( K -l)t\} + >>2tl)- 

(c) Fused lasso estimates. For fused lasso penalty terms, we consider 
the following approximation: 

K—\ p o K—X pi \2 

(14) A lt VV^ + A a VV ^4 . 



(15) 
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In the fused lasso scenario, the parameters are coupled together 
and the estimation of Wj are no longer separable. However, we 
circumvent the problem by expressing the estimating equation 
in terms of a vectorized form wj = vec(W^) = (wj, . . . , w^-i)', 
a column vector of dimension s = pt • (K — 1) by concatenating 
the columns of Wj. Then (14) can be expressed in the following 
form: 

AhWjAwj + A 2 tw[Lw(, 

where 

A = diag{l/|u/i|,...,l/|tt> s |}, 

L = D-M, 

M = < 3 (s x s dimension), 

[ 0, otherwise 

D = diagjcii, . . . , d s } where dj is the summation of the jth row of M. 

Letting C = X t £[Z'|X t ,W t ,# t ] and Q = E[ZZ'\X U W t ,# t ], 
the corresponding estimating equation is then 



where 
(16) Q 



ViQ 



Aj(w )+ Qw: 


= c, 
c = 


/afc[ 


<Q/ 




\°p t 2c P t 



where Cj is the jth row of C. The solution for (11) under the 
fused lasso penalty is then computed by iteratively computing 

(17) w i = (Q + 2A lt A + 2A 2t L)- 1 C. 

(2) Estimates of *&f Finally, for t = 1, . . . , T, we update ^t in the M-step 
as follows: 

(18) ^ t = -di ag (X t X' t -W t E[Z\{X t }f =1 ,{W t }f =1 ,{i' t }f =1 ]X' t ). 

n 

The algorithm iterates between the E-step and the M-step as described 
above until convergence. Cluster membership will then be assigned by ap- 
plying a standard K-means clustering on the posterior mean J5[Z|X]. In 
other words, cluster partition in the final step is performed in the integrated 
latent variable subspace of dimension n x {K — 1). Applying /c-means on 
latent variables to obtain discrete cluster assignment is commonly used in 
spectral clustering methods [Ng, Jordan and Weiss (2002), Rohe, Chatterjee 
and Yu (2010)]. 
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5. Choice of tuning parameters. We use a resampling-based criterion for 
selecting the penalty parameters and the number of clusters. The procedure 
entails repeatedly partitioning the data set into a learning and a test set. In 
each iteration, sparse iCluster (for a given K and tuning parameter values) 
will be applied to the learning set to obtain a classifier and subsequently 
predict the cluster membership for the test set samples. In particular, we 
first obtain parameter estimates from the learning set. For new observations 
in the test data X*, we then compute the posterior mean of the latent 
variables £7[Z|X*] = W^S^ X*, where W^, S^ denote parameter estimates 
from the learning set. A K-means clustering is then applied to £7[Z|X*] to 
partition the test set samples into K clusters. Denote this as partition C\. 
In parallel, the procedure applies an independent sparse iCluster with the 
same penalty parameter values to the test set to obtain a second partition C2, 
giving the "observed" test sample cluster labels. Under the true model, the 
predicted C\ and the "observed" C2 (regarded as the "truth") would have 
good agreement by measures such as the adjusted Rand index. We therefore 
define a reproducibility index (RI) as the median adjusted Rand index across 
all repetitions. Values of RI close to 1 indicate perfect cluster reproducibility 
and values of RI close to indicate poor cluster reproducibility. In this 
framework, the concepts of bias, variance and prediction error that typically 
apply to classification analysis where the true cluster labels are known now 
become relevant for clustering. The idea is similar to the "Clest" method 
proposed by Dudoit and Fridlyand (2002), the prediction strength measure 
proposed by Tibshirani and Walther (2005) and the in-group proportion 
(IGP) proposed by Kapp and Tibshirani (2007). 

6. Simulation. In this section we present results from two simulation 
studies. In the first simulation setup, we simulate a single length- n latent 
variable z ~ iV(0, 1) where n = 100. Subject j, j = 1, . . . , n, belongs to cluster 
1 if Zj > and cluster 2 otherwise. For simplicity, the pair of coefficient 
matrices (Wi, W2) are of the same dimension 200 x 1 (p\ =p 2 = 200), with 
wu = 3 for i = 1, . . . , 20 for both data types (t = 1, 2) and zero elsewhere. 
Next we obtain the data matrices (Xi,X2) with each element generated 
according to equation (1) with standard normal error terms. This simulation 
represents a scenario where an effective joint analysis of two data sets should 
be expected to enhance the signal strength and thus improve clustering 
performance. 

Table 1 summarizes the performances of each method in terms of the abil- 
ity to choose the correct number of clusters, cross-validated error rates and 
cluster reproducibility. In Table 1 separate K-means methods perform poorly 
in terms of the ability to choose the correct number of clusters, cluster repro- 
ducibility and the cross-validation error rates (with respect to the true sim- 
ulated cluster membership). K-means on concatenated data performs even 
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Table 1 

Clustering performance summarized over 50 simulated data sets under 

setup 1 (K = 2). Separate clustering methods have two sets of numbers 

associated with model fit to each individual data type. Numbers in parentheses 

are the standard deviations over 50 simulations 





Percent of 








times choosing 


Cross-validation 


Cluster 


Method 


the correct K 


error rate 


reproducibility 


Separate K-means 


58 


0.08 (0.04) 


0.67 (0.17) 




62 


0.08 (0.04) 


0.70 (0.19) 


Concatenated K-means 


50 


0.06 (0.04) 


0.66 (0.19) 


Separate sparse SVD 


74 


0.07 (0.06) 


0.71 (0.13) 




76 


0.07 (0.07) 


0.72 (0.12) 


Concatenated sparse SVD 


78 


0.07 (0.08) 


0.70 (0.12) 


Separate AHP-GMM 


38 


0.06 (0.04) 


0.72 (0.15) 




40 


0.05 (0.04) 


0.74 (0.14) 


Concatenated AHP-GMM 


46 


0.06 (0.04) 


0.75 (0.13) 


Lasso iCluster 


90 


0.04 (0.02) 


0.81 (0.08) 


Enet iCluster 


94 


0.03 (0.02) 


0.85 (0.07) 


Fused lasso iCluster 


94 


0.03 (0.02) 


0.83 (0.08) 



worse, likely due to noise accumulation. For sparse SVD, a cluster assign- 
ment step is needed. We took a similar approach of applying K-means on the 
first K — 1 right singular vectors of the data matrix. Sparse SVD performs 
better than simple K-means, though data concatenation does not seem to 
offer much advantage. In this simulation scenario, AHP-GMM models show 
good performance in feature selection (Table 2), but appear to have a low 



Table 2 

Feature selection performance summarized over 50 simulated data sets for K = 2. There 

are a total of 20 true features simulated to distinguish the two sample clusters 





Data 1 


Data 2 




True 


False 


True 


False 


Method 


positives 


positives 


positives 


positives 


Separate K-means 


- 


- 


- 


- 


Concatenated K-means 


- 


- 


- 


- 


Separate sparse SVD 


18.7 (3.2) 


21.5 (37.7) 


18.8 (2.9) 


27.4 (43.6) 


Concatenated sparse SVD 


14.0 (5.3) 


22.5 (16.1) 


13.7 (5.2) 


22.8 (16.4) 


Separate AHP-GMM 


19.6 (2.1) 


0.02 (0.16) 


19.1 (3.1) 


0(0) 


Concatenated AHP-GMM 


18.8 (3.6) 


0.02 (0.15) 


18.6 (4.0) 


0.02 (0.15) 


Lasso iCluster 


20 (0) 


0.07 (0.3) 


20 (0) 


0.07 (0.3) 


Enet iCluster 


20 (0) 


0.1 (0.3) 


20 (0) 


0.02 (0.1) 


Fused lasso iCluster 


20 (0) 


0(0) 


20 (0) 


0(0) 
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Table 3 
Clustering performance summarized over 50 simulated data sets under setup 2 (K = 3) 





Frequency of 








choosing the 


Cross-validation 


Cluster 


Method 


correct K 


error rate 


reproducibility 


Separate K-means 


2 


0.33 (0.001) 


0.54 (0.07) 







0.33 (0.002) 


0.47 (0.04) 


Concatenated K-means 


100 


0.01 (0.07) 


0.96 (0.03) 


Separate sparse SVD 





0.28 (0.10) 


0.45 (0.03) 







0.31 (0.07) 


0.44 (0.04) 


Concatenated sparse SVD 


16 


0.01 (0.002) 


0.59 (0.05) 


Separate AHP-GMM 





0.07 (0.13) 


0.63 (0.05) 







0.32 (0.02) 


0.54 (0.06) 


Concatenated AHP-GMM 


100 


0.01 (0.07) 


0.98 (0.03) 


Lasso iCluster 


100 


0.0003 (0.001) 


0.98 (0.01) 


Enet iCluster 


100 


0.0003 (0.001) 


0.97 (0.02) 


Fused lasso iCluster 


100 


0(0) 


0.94 (0.05) 



frequency of choosing the correct K = 2. A common theme in this simulation 
is that a data concatenation approach is generally ineffective regardless of 
the clustering methods used. By contrast, sparse iCluster methods achieved 
an effective integrative outcome across all performance criteria. 

Table 2 summarizes the associated feature selection performance. No num- 
bers are shown for the standard K-means methods, as they do not have 
an inherent feature selection method. Among the methods, sparse iClus- 
ter methods perform the best in identifying the true positive features while 
keeping the number of false positives close to 0. 

In the second simulation, we vary the setup as follows. We simulate 
150 subjects belonging to three clusters (K = 3). Subjects j = 1, ... ,50 be- 
long to cluster 1, subjects j = 51, . . . , 100 belong to cluster 2, and subjects 
j = 101, . . . , 150 belong to cluster 3. A total of T = 2 data types (Xi, X2) 
are simulated. Each has pi = P2 = 500 features. Here each data type alone 
only defines two clusters out of the three. In data set 1, xij\ ~ 7V(2,1) for 
i = 1, . . . , 10 and j = 1, . . . , 50, x ijX ~ 7V(1.5, 1) for % = 491, . . . , 500 and j = 
51, ... , 100, and Xij\ ~ iV(0, 1) for the rest. In data set 2, Xij2 = 0.5 * x; L j\ + e 
where e ~ iV(0, 1) for i = 1, . . . , 10 and j = 1, . . . , 50, x ij2 ~ N(2, 1) for i = 
491, . . . , 500 and j = 101, . . . , 150, and x ij2 ~ N(0, 1) for the rest. The first 
10 features are correlated between the two data types. In Tables 3 and 4, 
the sparse iCluster methods consistently outperform the other methods in 
clustering and feature selection. 

The core iCluster EM iterations are implemented in C. Table 5 shows 
some typical computation times for problems of various dimensions on a 
3.2 GHz Xeon Linux computer. 
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Table 4 
Feature selection performance summarized over 50 simulated data sets under K = 3 







Data 1 


: 


Data 2 




True 


False 


True 


False 


Method 


positives 


positives 


positives 


positives 


Separate K-means 


- 


- 


- 


- 


Concatenated K-means 


- 


- 


- 


- 


Separate sparse SVD 


19.8 (0.7) 


349.6 (167.1) 


19.9 (0.3) 


347.5 (142.5) 


Concatenated sparse SVD 


20 (0) 


396.6 (128.7) 


19.6 (1.6) 


395.4 (128.3) 


Separate AHP-GMM 


15.8 (5.0) 


239.9 (245.5) 


15.5 (5.5) 


269.9 (246) 


Concatenated AHP-GMM 


19.2 (1.7) 


0.33 (0.64) 


14.4 (4.0) 


0.21 (0.66) 


Lasso iCluster 


20 (0) 


1.5 (1.4) 


19.9 (0.2) 


1.9 (1.5) 


Enet iCluster 


20 (0) 


0.5 (0.6) 


19.8 (0.5) 


0.7 (1.0) 


Fused lasso iCluster 


20 (0) 


0(0) 


20 (0) 


0(0) 



7. Results. In this section we present details of two real data applications. 

7.1. Integration of epigenomic and trans criptomic profiling data in the 
Holm breast cancer study. In Section 2 we discussed a motivating example 
using the Pollack et al. (2002) data set. In this section we present our first 
real data application which involves integrative analysis of DNA methyla- 
tion and gene expression data from the Holm et al. (2010) study. In this 
data set, methylation profiling in 189 breast cancer samples using Illumina 
methylation arrays for 1452 CpG sites (corresponding to 803 cancer-related 
genes) is available. The original study performed a hierarchical clustering 
on the methylation data alone. Through manual integration, the authors 
then correlated the methylation status with gene expression levels for 511 
oligonucleotide probes for genes with CpG sites on the methylation assays in 
the same sample set. Here we compare clustering of individual data types to 
various integration approaches. We included the most variable 288 CpG sites 

Table 5 
Computing time (in seconds) for typical runs of sparse iCluster under various dimensions 

Time (in seconds) 

p N Lasso iCluster Elastic net iCluster Fused lasso iCluster 

0.37 

3.56 

25.05 

76.40 

33 (min) 



200 


100 


0.10 


0.11 


500 


100 


0.50 


0.36 


1000 


100 


1.40 


1.45 


2000 


100 


6.49 


5.90 


5000 


100 


18.93 


18.94 
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Fig. 3. Separation of the data points by (A) latent variables from sparse iCluster, 
(B) right singular vectors from SVD of the methylation data alone, (C) right singular- 
vectors from SVD of the expression data alone, (D) SVD on the concatenated data matrix, 
and (E) sparse SVD on the concatenated data matrix. Red dots indicate samples belonging 
to cluster 1, blue open triangles indicate samples belonging to cluster 2, and orange pluses 
indicate samples belonging to cluster 3. 



(following a similar procedure taken in the Holm study) in the methylation 
data. 

We applied sparse iCluster for a joint analysis of the methylation {p\ = 
288) and gene expression (p2 = 511) data using different penalty combina- 
tions. In Figure 3(A) the first two latent variables separated the samples into 
three distinct clusters. By associating the cluster membership with clinical 
variables, it becomes clear that tumors in cluster 1 are predominantly estro- 
gen receptor (ER)-negative and associated with the basal-like breast cancer 
subtype (Figure 4). Among the rest of the samples, sparse iCluster further 



Cluster 1 



Basal-like 
ER status, 

BRCAitnnm 



PDGFRB 1 
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Fig. 4. Integrative clustering of the Holm study DNA methylation and gene expression 
data revealed three clusters with a cross-validated reproducibility of 0. 7 and distinct clinical 
and molecular characteristics. 

identifies a subclass (cluster 3) that highly expresses platelet-derived growth 
factor receptors (PDGFRA/B), which have been associated with breast can- 
cer progression [Carvalho et al. (2005)]. 

In Section 3.2 we discussed an SVD approach on a combined data matrix 
as a special case of our model. Here we present results from SVD and a sparse 
SVD algorithm proposed by Witten, Tibshirani and Hastie (2009) on the 
concatenated data matrix. Figures 3(B) and 3(C) indicate that SVD applied 
to each data type alone can only separate one out of the three clusters. 
Figures 3(D) and 3(E) indicate that data concatenation does not perform 
any better in this analysis than separate analyses of each data type alone. 

In Table 6 the results from sparse iCluster with two different sets of 
penalty combinations are presented: the combination of (lasso, lasso) and 



Table 6 

Cluster reproducibility and number of genomic features selected using 

sparse iCluster, sparse SVD on concatenated data matrix and Adaptive Hierarchically 

Penalized Gaussian Mixture Model (AHP-GMM) on concatenated data matrix. Two 

variations of the sparse iCluster method were presented: iCluster (lasso, lasso) 

implements lasso penalty for both data types, and iCluster (lasso, elastic net) implements 

lasso penalty for the methylation data and elastic net penalty for the gene expression 

data. K: the number of clusters. RI: reproducibility index 







Selected 


Selected 




Selected 


Selected 






methylation 


expression 




methylation 


expression 


K 


RI 


features 


features 


RI 


features 


features 






iCluster (lasso, 


lasso) 




iCluster (lasso, elastic 


net) 


2 


0.68 


138 


151 


0.70 


183 


353 


3 


0.46 


150 


204 


0.70 


273 


182 


4 


0.42 


183 


398 


0.48 


273 


182 


5 


0.42 


205 


454 


0.47 


282 


223 






sparse SVD 




AHP-GMM 




2 


0.78 


1 


105 


0.93 


9 


6:5 


3 


0.34 


1 


134 


0.42 


28 


105 


4 


0.27 


288 


511 


0.49 


116 


368 


5 


0.22 


273 


504 


0.43 


42 


243 
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Fig. 5. Integrative clustering of the Holm study DNA methylation and gene expression 
data revealed three clusters with a cross-validated reproducibility of 0. 7. Selected genes with 
negatively correlated methylation and expression changes are indicated to the left of the 
heatmap. 

the combination of (lasso, elastic net) for methylation and gene expression 
data, respectively (Table 6 top panel). The reproducibility index (RI) is 
computed for various K 7 s and penalty parameters are sampled based on 
a uniform design described in Section 3.3. As described in Section 5, RI 
(ranges between and 1) measures the agreement between the predicted 
cluster membership and the "observed" cluster membership using a 10-fold 
cross-validation. 

Both methods identified a 2-cluster solution with an RI around 0.70, dis- 
tinguishing the ER- negative, Basal-like subtype from the rest of the tumor 
samples (Figures 3 and 4, samples labeled in red). The iCluster (lasso, elas- 
tic net) method adds an £2 penalty term to encourage grouped selection 
of highly correlated genes in the expression data. This approach further 
identified a 3-cluster solution with high reproducibility (RI = 0.70). The 
additional division finds a subgroup that highly expresses platelet-derived 
growth factor receptors (Figure 4). 

Figure 5 displays heatmaps of the methylation and expression data. Col- 
umns are samples ordered by the integrated cluster assignment. Rows are 
cluster-discriminating genes (with nonzero coefficient estimates) grouped 
into gene clusters by hierarchical clustering. In total, there are 273 differ- 
entially methylated genes and 182 differentially expressed genes. Several 
cancer genes including MUC1, SERPINA5, RARA, MECP2 and RAD50 
are hypermethylated and show concordant underexpression in cluster 1. On 
the other hand, hypomethylation of cancer genes including ETS1, HDAC1, 
FANCE, RAB32 and JAK3 are observed and, correspondingly, these genes 
show increased expression levels. 
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Fig. 6. Illustration of copy number probe-level data from a lung tumor sample [Chitale 
et al. (2009)]. Log-ratios of copy number (tumor versus normal) on chromosomes 3 and 
8 are displayed. Log-ratio greater than zero indicates copy number gain and log-ratio be- 
low zero indicates loss. Black line indicates the segmented value using the circular binary 
segmentation method [Olshen et al. (2004), Venkatraman and Olshen (2007)]. 



To compare with other methods, we implemented the sparse SVD method 
by Witten, Tibshirani and Hastie (2009) and an adaptive hierarchical pe- 
nalized Gaussian mixture model (AHP-GMM) by Wang and Zhu (2008) on 
the concatenated data matrix. None of these methods generated additional 
insights beyond separating the ER-negative and basal-like tumors from the 
others (Figure 3 and Table 6). Feature selection is predominantly "biased" 
toward gene expression features when directly applying sparse SVD on the 
combined data matrix (bottom panel of Table 6), likely due to the larger 
between-cluster variances observed in the gene expression data. 

7.2. Constructing a genome-wide portrait of concordant copy number and 
gene expression pattern in a lung cancer data set. We applied the proposed 
method to integrate DNA copy number (aCGH data) and mRNA expression 
data in a set of 193 lung adenocarcinoma samples [Chitale et al. (2009)]. 
Figure 6 displays an example of the probe-level data (log-ratios of tumor 
versus normal copy number) on chromosomes 3 and 8 in one tumor sample. 
Many samples in this data set display similar chr 3p whole-arm loss and chr 
3q whole-arm gain. 

Arm-length copy number aberrations are surprisingly common in cancer 
[Beroukhim et al. (2010)], affecting up to thousands of genes within the 
region of alteration. A broader challenge is thus to pinpoint the "driver" 
genes that have functional roles in tumor development from those that are 
functionally neutral ("passengers"). To that end, an integrative analysis with 
gene expression data could provide additional insights. Genes that show 
concordant copy number and transcriptional activities are more likely to 
have functional roles. 

In the search for copy number-associated gene expression patterns, we fit 
a sparse iCluster model for each of the 22 chromosomes using (fused lasso, 
lasso) a penalty combination for joint analysis of copy number and gene 
expression data. To facilitate comparison, we compute a 2-cluster solution 
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Fig. 7. Penalized coefficient vector estimates arranged by chromosomes 1 to 22 derived 
by iCluster (fused lasso, lasso) applied to the Chitale et al. lung cancer data set. A single 
latent variable vector is used to identify the major pattern of each chromosome. 



with a single latent variable vector z (instead of estimating K ) to extract 
the major pattern for each chromosome. Penalty parameter tuning is per- 
formed as described before. In Figure 7 we plot the 22 pairs of the sparse 
coefficient vectors ordered by chromosomal position. The coefficients can be 
interpreted as the difference between the two cluster means. Positive and 
negative coefficient values in Figure 7(A) thus indicate copy number gains 
and losses in one cluster relative to the other. Similarly, in Figure 7(B), 
coefficient signs indicate over- or under-expression in one cluster relative to 
the other. Concordant copy number and gene expression changes can thus 
be directly visualized from Figure 7. 

Several chromosomes (1, 3, 8, 10, 15 and 16) show contiguous regions of 
gains or losses spanning whole chromosome arms. As discussed before, arm- 
length aberrations can affect up to thousands of genes within the region 
of alteration. A great challenge is thus to pinpoint the "driver" genes that 
have important roles in tumor development from those that are functionally 
neutral ("passengers"). To that end, an integrative analysis could provide 
additional insights for identifying potential drivers by revealing genes with 
concordant copy number and transcriptional activities. Figure 7 shows that 
the application of the proposed method can unveil a genome-wide pattern 
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of such concordant changes, providing a rapid way for identifying candidate 
genes of biological significance. Several arm-level copy number alterations 
(chromosomes 3, 8, 10, 16) exhibit concerted influence on the expression of 
a small subset of the genes within the broad regions of gains and losses. 

8. Discussion. Integrative genomics is a new area of research accelerated 
by large-scale cancer genome efforts including the Cancer Genome Atlas 
Project. New integrative analysis methods are emerging in this field, van 
Wieringen and van de Wiel (2009) proposed a nonparametric testing pro- 
cedure for DNA copy number induced differential mRNA gene expression. 
Peng et al. (2010) and Vaske et al. (2010) considered pathway and network 
analysis using multiple genomic data sources. A number of others [Waai- 
jenborg, Verselewel de Witt Hamer and Zwinderman (2008), Parkhomenko, 
Tritchler and Beyene (2009), Le Cao, Martin and Robert-Granie (2009), 
Witten, Tibshirani and Hastie (2009), Witten and Tibshirani (2009), Sone- 
son et al. (2010)] suggested using canonical correlation analysis (CCA) to 
quantify the correlation between two data sets (e.g., gene expression and 
copy number data). Most of this previous work focused on integrating copy 
number and gene expression data, and none of these methods were specifi- 
cally designed for tumor subtype analysis. 

We have formulated a penalized latent variable model for integrating mul- 
tiple genomic data sources. The latent variables can be interpreted as a set of 
distinct underlying cancer driving factors that explain the molecular pheno- 
type manifested in the vast landscape of alterations in the cancer genome, 
epigenome and transcriptome. Lasso, elastic net and fused lasso penalty 
terms are used to induce sparsity in the feature space. We derived an effi- 
cient and unified algorithm. The implementation scales well for increasing 
data dimension. 

A future extension on group-structured penalty terms is to incorporate 
a grouping structure defined a priori. Two types of group structures are 
relevant for our application. One is to treat the wn, . . . , w^—i) as a group 
since they are associated with the same feature. Yuan and Lin's group lasso 
penalty [Yuan and Lin (2006)] can be applied directly. Similar to our current 
algorithm, by using Fan and Li's local quadratic approximation, the problem 
reduces to a ridge-type regression in each iteration. The other extension is 
to incorporate the grouping structure among features to boost the signal to 
noise ratio, for example, to treat the genes within a pathway as a group. 
We can consider a hierarchical lasso penalty [Wang et al. (2009)] to achieve 
sparsity at both the group level and the individual variable level. 
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